Sentiment Analysis of New York Times Comments(2017)

Author

Francis Lauder

Published

January 7, 2026

Introduction

Natural language processing (NLP) basically focuses on using computers to analyze and understand text. It converts human language into a format that machines can understand and process. This enables computers to analyze and make sense of human language. There are many techniques used in NLP, such as text classification, topic modeling, and others. For our project we’ll be using the sentiment analysis technique. Sentiment analysis determines the opinion of an author expressed in a piece of text. The text can be categorized as either positive, negative, or neutral. We’ll be analyzing comments from the New York Times for the 2017 year. The data includes then months January through April. The datasets used for our analysis can be found on the Kaggle web site at then following link: New York Times Comments

Code used for our analysis is hidden and can be viewed by expanding the “Show Code” option

Data Inspection

Libraries

Show Code
library(dplyr)
library(ggplot2)
library(treemapify)
library(plotly)
library(tidyr)
library(tidytext)
library(SnowballC)
library(syuzhet)
library(furrr)
library(quarto)

Data Review

Show Code
nrow(nyt_comments17)
[1] 969655
Show Code
nyt_comments17 |> glimpse()
Rows: 969,655
Columns: 5
$ commentBody    <chr> "This project makes me happy to be a 30+ year Times sub…
$ commentID      <int> 22022598, 22017350, 22017334, 22015913, 22015466, 22012…
$ commentType    <chr> "comment", "comment", "comment", "comment", "comment", …
$ newDesk        <chr> "Insider", "Insider", "Insider", "Insider", "Insider", …
$ typeOfMaterial <chr> "News", "News", "News", "News", "News", "News", "News",…

We see from the above outputs that our dataset has just under 970,000 nrows. We will not require the commentType column and will remove it.

Show Code
nyt_comments17_short<-nyt_comments17 |> 
    select(-commentType)
Show Code
nyt_comments17_short |> 
    glimpse()
Rows: 969,655
Columns: 4
$ commentBody    <chr> "This project makes me happy to be a 30+ year Times sub…
$ commentID      <int> 22022598, 22017350, 22017334, 22015913, 22015466, 22012…
$ newDesk        <chr> "Insider", "Insider", "Insider", "Insider", "Insider", …
$ typeOfMaterial <chr> "News", "News", "News", "News", "News", "News", "News",…

Data Preparation

Now that our data is selected, we will prepare our data for analysis. First a function will be created to clean the text in the commentBody. This will include removing HTML/Markdown tags, backslashes, and replacing multiple spaces with a single space.

Show Code
clean_comment_text <- function(text) {
  text %>%
    gsub("<.*?>", " ", .) %>%                  # Remove HTML/Markdown tags like <br/>
    gsub("\\\\", " ", .) %>%                   # Remove backslashes
    gsub("[^\\p{L}\\p{N}\\s!?\\p{Emoji_Presentation}]", " ", ., perl = TRUE) %>%  
    gsub("\\s+", " ", .) %>%                   # Replace multiple spaces with single space
    trimws()
}
Show Code
nyt_comments_clean<-nyt_comments17_short |> 
    mutate(clean_comments = clean_comment_text(commentBody))
Show Code
nyt_comments_clean |> 
    glimpse()
Rows: 969,655
Columns: 5
$ commentBody    <chr> "This project makes me happy to be a 30+ year Times sub…
$ commentID      <int> 22022598, 22017350, 22017334, 22015913, 22015466, 22012…
$ newDesk        <chr> "Insider", "Insider", "Insider", "Insider", "Insider", …
$ typeOfMaterial <chr> "News", "News", "News", "News", "News", "News", "News",…
$ clean_comments <chr> "This project makes me happy to be a 30 year Times subs…

Sentiment Extraction

Show Code
nrow(nyt_comments_clean)
[1] 969655

Because the dataset is relatively large at 969,655 rows, we will use parallel processing for efficency and speed during sentiment extraction. To accomplish this we will use the future_map_dbl function from the the furrr library. This function helps use cpu cores to parallel tasks. All but one cpu core will be used for our processing.

Show Code
plan(multisession, workers = parallel::detectCores() - 1)

For our sentiment analysis we will need to choose a sentiment lexicon. A lexicon is basically a sentiment dictionary that includes words and the sentiment the word has been tagged with such as negative or positive. We’ll be using the AFINN dictionary, which instead of each work being tagged as negative or positive, they are given a numeric score with a negative score indicating negative sentiment, a positive score indicating a positive sentiment, and a score of zero indicating neutral sentiment. The numeric scores range from -5 to 5. AFINN calculates the compounded score for each sentence in the text by adding the scores (weights) of each of the sentiment words. From this calculated score we can assign a category for each row of text.

Show Code
get_sentiments("afinn")
# A tibble: 2,477 × 2
   word       value
   <chr>      <dbl>
 1 abandon       -2
 2 abandoned     -2
 3 abandons      -2
 4 abducted      -2
 5 abduction     -2
 6 abductions    -2
 7 abhor         -3
 8 abhorred      -3
 9 abhorrent     -3
10 abhors        -3
# ℹ 2,467 more rows

The first ten rows of te AFINN dictionary are shown above. We see the first column is the word and second is the value holding the sentiment score.

Show Code
comments17_sentiment<-nyt_comments_clean |> 
  mutate(
      sentiment_score=future_map_dbl(clean_comments, get_sentiment, method = "afinn")
  )
Show Code
comments17_sentiment |> glimpse()
Rows: 969,655
Columns: 6
$ commentBody     <chr> "This project makes me happy to be a 30+ year Times su…
$ commentID       <int> 22022598, 22017350, 22017334, 22015913, 22015466, 2201…
$ newDesk         <chr> "Insider", "Insider", "Insider", "Insider", "Insider",…
$ typeOfMaterial  <chr> "News", "News", "News", "News", "News", "News", "News"…
$ clean_comments  <chr> "This project makes me happy to be a 30 year Times sub…
$ sentiment_score <dbl> 5, -2, 8, 0, 4, 5, 4, 13, 3, 0, 4, -3, -5, 0, 1, -5, 7…
Show Code
comments17_sentiment |> 
  select(commentID, sentiment_score) |> 
  head(30)
   commentID sentiment_score
1   22022598               5
2   22017350              -2
3   22017334               8
4   22015913               0
5   22015466               4
6   22012085               5
7   22003784               4
8   22024897              13
9   22082978               3
10  22004930               0
11  22005135               4
12  22004841              -3
13  22005149              -5
14  22004746               0
15  22005218               1
16  22005228              -5
17  22004632               7
18  22004617               5
19  22004589              -2
20  22004546               4
21  22004815              10
22  22004629               3
23  22005189              -2
24  22005185               5
25  22004932               0
26  22004805               0
27  22004566              -1
28  22004431              -5
29  22004413               1
30  22004294               0

Using the glimpse function we now find that there is a new column named “sentiment_score”. Viewing the first thirty rows we see that a score has been assigned to each row.

Now that we have sentiment scores our next step is to assign a sentiment based on these scores. Negative scores (-5 to -1) will be assigned a sentiment of “negative”, positive scores (1 to 5) will be assigned a sentiment of “positive”, and a score of zero will be assigned “neutral”. This will be accomplished using the case_when function from the dplyr library.

Show Code
comments17_sentiment<-comments17_sentiment %>% 
mutate(sentiment=case_when(
  sentiment_score > 0 ~ "positive",
  sentiment_score < 0 ~ "negative",
  TRUE ~ "neutral"
))
Show Code
comments17_sentiment |> 
  glimpse()
Rows: 969,655
Columns: 7
$ commentBody     <chr> "This project makes me happy to be a 30+ year Times su…
$ commentID       <int> 22022598, 22017350, 22017334, 22015913, 22015466, 2201…
$ newDesk         <chr> "Insider", "Insider", "Insider", "Insider", "Insider",…
$ typeOfMaterial  <chr> "News", "News", "News", "News", "News", "News", "News"…
$ clean_comments  <chr> "This project makes me happy to be a 30 year Times sub…
$ sentiment_score <dbl> 5, -2, 8, 0, 4, 5, 4, 13, 3, 0, 4, -3, -5, 0, 1, -5, 7…
$ sentiment       <chr> "positive", "negative", "positive", "neutral", "positi…
Show Code
comments17_sentiment |> 
  select(commentID, sentiment) |> 
  head(30)
   commentID sentiment
1   22022598  positive
2   22017350  negative
3   22017334  positive
4   22015913   neutral
5   22015466  positive
6   22012085  positive
7   22003784  positive
8   22024897  positive
9   22082978  positive
10  22004930   neutral
11  22005135  positive
12  22004841  negative
13  22005149  negative
14  22004746   neutral
15  22005218  positive
16  22005228  negative
17  22004632  positive
18  22004617  positive
19  22004589  negative
20  22004546  positive
21  22004815  positive
22  22004629  positive
23  22005189  negative
24  22005185  positive
25  22004932   neutral
26  22004805   neutral
27  22004566  negative
28  22004431  negative
29  22004413  positive
30  22004294   neutral

Our data now has a “sentiment” column reflecting negative, positive, or neutral, base on the sentiment_score coluimn.

Sentiment Review

Visualization can help us view the percent breakdown of sentiment and compare sentiment by type of material and news desk.

Show Code
sentiment_tree<-comments17_sentiment |> 
count(sentiment) |> 
mutate(perc = round(n/sum(n),3)*100)
Show Code
tree_map<-ggplot(sentiment_tree, aes(area=perc, fill = sentiment, label=perc))+
geom_treemap()+
geom_treemap_text()+
ggtitle("Sentiment (Percent)")+
theme(plot.title = element_text(color="black", size=14, face="bold.italic", hjust=0.5))+
scale_fill_discrete(name = "Sentiment")+
scale_y_continuous(labels=scales::percent_format())

Sentiment Tree Map

From the Tree Map plot we find that postive and negative sentiments are virtuall equal at 42.2% and 42.8% respectively.

Show Code
dot_cnt<-ggplot(comments17_sentiment, aes(x=sentiment,y=typeOfMaterial))+
geom_count(aes(colour=sentiment))
Show Code
gg_dot_cnt<-ggplotly(dot_cnt)


Sentiment by Type of Material

Editorial has the greatest difference between negative and positive sentiment.

Show Code
tile_cnt<-comments17_sentiment |> 
group_by(newDesk) |> 
count(sentiment, newDesk) |> 
mutate(perc=round(n/sum(n),2)*100)
Show Code
g_tile<-tile_cnt |> 
ggplot(aes(x=sentiment,y=newDesk))+
    geom_tile(aes(fill=perc))+
    labs(fill = "Percent")+
    xlab("Sentiment")+
    theme(axis.title.y=element_blank())
Show Code
gg_gtile<-ggplotly(g_tile)

Sentiment by News Desk